1753160 – [3.11] Pods in orphaned state

Bug 1753160 - [3.11] Pods in orphaned state

Summary: [3.11] Pods in orphaned state

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.11.0
Hardware:	All
OS:	All
Priority:	high
Severity:	low
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Hemant Kumar
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-18 09:48 UTC by Camino Noguera
Modified:	2024-06-13 22:14 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-23 16:28:20 UTC
Target Upstream Version:
Embargoed:
Flags:	mnoguera: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1558600	unspecified	CLOSED	excessive amount of 'orphaned path' messages on openshift online nodes.	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1583707	unspecified	CLOSED	Orphaned pod found, but volumes not yet removed on OCP 3.7	2023-03-24 14:06:10 UTC
Red Hat Knowledge Base (Solution)	3237211	None	None	None	2019-09-18 10:00:41 UTC

Description Camino Noguera 2019-09-18 09:48:27 UTC

Description of problem:
journalctl show an orphaned pod


Aug 28 10:33:51 sxxxxx atomic-openshift-node[14947]: E0828 10:33:51.246241   14947 kubelet_volumes.go:140] Orphaned pod "0f59ef88-b4c1-11e9-afd0-00505680343f" found, but volume paths are still present on disk: There were a total of 15 errors similar to this. Turn up verbosity to see them.

It is not clear why the issue started because there wasn't any node outage recently but they are having an issue with docker daemon in the node (usually) it hangs.

 
No persistent storage used.
 
It is supposed that was solved in 3.10:
https://access.redhat.com/errata/RHBA-2018:1816


Version-Release number of selected component (if applicable):
3.11.104
docker version 1.13.1-96

there is a workaround:

https://access.redhat.com/solutions/3237211

rm -rf /var/lib/orgin/openshift.local.volumes/pods/<pod-id>/

Comment 2 Jan Safranek 2019-09-23 12:03:56 UTC

The attached logs contains "Orphaned pod "0f59ef88-b4c1-11e9-afd0-00505680343f" found", however, it does not contain any details how the pod was deleted and why its volumes are orphaned.

To get rid of the message in 3.11, it should be enough to restart atomic-openshift-node service to clean up the orphans. If that does not work, deleting the orphans manually (https://access.redhat.com/solutions/3237211) will definitely help.

To find the root cause, we need to know:

1. What happened to the pod when it was deleted (was the node rebooted? was openshift-node service restarted?). Node logs where 'Orphaned pod XYZ' appeared from the first time + 10-20 minutes before would be the best.

2. *What* volume was orphaned. Simple "find /var/lib/origin/openshift.local.volumes/pods/0f59ef88-b4c1-11e9-afd0-00505680343f| is a good start. Not all volumes that Kubernetes uses are persistent, for example each pod gets a Secret volume with API token:

/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/volumes
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/volumes/kubernetes.io~secret
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir/wrapped_default-token-tpsl2
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir/wrapped_default-token-tpsl2/ready

With that listing, we can see that a secret volume was orphaned and we know where to look.

Comment 38 Ryan Phillips 2020-06-05 18:34:39 UTC

It looks like this is an issue with Ceph CSI. I am not sure if a patch will be backported into 3.11. I'm going to reassign this issue to storage to track.

Issue: https://github.com/kubernetes/kubernetes/issues/60987#issuecomment-638750828
Potential Upstream PR: https://github.com/ceph/ceph-csi/pull/1134

Comment 40 Hemant Kumar 2020-06-23 16:28:20 UTC

I am closing this bug since I haven't heard any update from reporter. Storage team is aware that this issue is not 100% fixed for certain drivers but since this does not affect stability of a cluster, we aren't actively working on it.

The reason I say fix is specific to volume driver is because after node/kubelet restart, the volume reconstruction from disk depends on plugin type. And that is the logic that needs tightening so as volumes can be cleaned up. If this re-occurs, we would appreciate a new bug with specific volume driver which is causing the problem.

Comment 41 Red Hat Bugzilla 2024-02-04 04:25:22 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.