Bug 1753160
| Summary: | [3.11] Pods in orphaned state | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Camino Noguera <mnoguera> |
| Component: | Storage | Assignee: | Hemant Kumar <hekumar> |
| Status: | CLOSED WONTFIX | QA Contact: | Qin Ping <piqin> |
| Severity: | low | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.11.0 | CC: | aanjarle, aos-bugs, aos-storage-staff, bbennett, ckoep, jokerman, jsafrane, mtleilia, nchoudhu, rphillips, sburke, scuppett, ssadhale, swachira, vjaypurk |
| Target Milestone: | --- | Flags: | mnoguera:
needinfo-
hekumar: needinfo? (mtleilia) |
| Target Release: | 3.11.z | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-06-23 16:28:20 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Camino Noguera
2019-09-18 09:48:27 UTC
The attached logs contains "Orphaned pod "0f59ef88-b4c1-11e9-afd0-00505680343f" found", however, it does not contain any details how the pod was deleted and why its volumes are orphaned. To get rid of the message in 3.11, it should be enough to restart atomic-openshift-node service to clean up the orphans. If that does not work, deleting the orphans manually (https://access.redhat.com/solutions/3237211) will definitely help. To find the root cause, we need to know: 1. What happened to the pod when it was deleted (was the node rebooted? was openshift-node service restarted?). Node logs where 'Orphaned pod XYZ' appeared from the first time + 10-20 minutes before would be the best. 2. *What* volume was orphaned. Simple "find /var/lib/origin/openshift.local.volumes/pods/0f59ef88-b4c1-11e9-afd0-00505680343f| is a good start. Not all volumes that Kubernetes uses are persistent, for example each pod gets a Secret volume with API token: /var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/volumes /var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/volumes/kubernetes.io~secret /var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins /var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir /var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir/wrapped_default-token-tpsl2 /var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir/wrapped_default-token-tpsl2/ready With that listing, we can see that a secret volume was orphaned and we know where to look. It looks like this is an issue with Ceph CSI. I am not sure if a patch will be backported into 3.11. I'm going to reassign this issue to storage to track. Issue: https://github.com/kubernetes/kubernetes/issues/60987#issuecomment-638750828 Potential Upstream PR: https://github.com/ceph/ceph-csi/pull/1134 I am closing this bug since I haven't heard any update from reporter. Storage team is aware that this issue is not 100% fixed for certain drivers but since this does not affect stability of a cluster, we aren't actively working on it. The reason I say fix is specific to volume driver is because after node/kubelet restart, the volume reconstruction from disk depends on plugin type. And that is the logic that needs tightening so as volumes can be cleaned up. If this re-occurs, we would appreciate a new bug with specific volume driver which is causing the problem. |