| Summary: | pods are unable to delete after storage endpoints were down | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Qian Cai <qcai> |
| Component: | Storage | Assignee: | Bradley Childs <bchilds> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Liming Zhou <lizhou> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 3.5.0 | CC: | agoldste, gouyang, jhou, jsafrane, lizhou, mmcgrath, qcai |
| Target Milestone: | --- | Keywords: | Extras |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-03-20 02:54:04 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Qian Cai
2016-03-15 17:29:45 UTC
Jan, any idea what can be wrong? I believe this is expected behaviour. Once the endpoint is down request for deletion (and other operations) are not forwarded to deamons (or receivers of glusterfs/cephrbd). Maybe the pod worker (pod procedure) just waits for a volume to get deleted/unattached so the data inside of it are stored/garbaged correctly/properly. CAI, does the pod get deleted once the endpoints are back online? (In reply to Jan Chaloupka from comment #3) > CAI, does the pod get deleted once the endpoints are back online? The endpoints were accidentally gone. Have to wait for someone to bring it back. Looking at Gluster volume plugin, it just calls umount() and it probably does not finish timely. That should be fixed by upcoming Attach and Mount controllers - they should let the pod die and periodically try to unmount or detach any stale volumes until it finally succeeds. Adding Sami to cc:. Once this happened, it became a mess.
1) those zombie pods just stuck there. They won't disappear even after restart all daemon in master/node and manually unmount the volumes.
2) on the node: "systemctl stop kubelet" will hang in the place that looks like trying to umount the volumes.
1) Reboot the node will hang here,
[[32m OK [0m] Stopped Create Static Device Nodes in /dev.
Stopping Create Static Device Nodes in /dev...
[[32m OK [0m] Stopped Remount Root and Kernel File Systems.
Stopping Remount Root and Kernel File Systems...
[[32m OK [0m] Reached target Shutdown.
[1366830.047161] libceph: connect 10.70.43.76:6804 error -101
[1366830.049112] libceph: osd4 10.70.43.76:6804 connect error
Even the soft reboot button from the Openstack management UI won't recover the instance. Have to ask the Openstack admin to fix it.
Suspect this also affect NFS or iSCSI. https://github.com/kubernetes/kubernetes/issues/24819 looks like they're same problem. Guohua, 24819 is a very specific issue that is unrelated to remote storage. I believe the actual issue is https://github.com/kubernetes/kubernetes/issues/27463, which was recently fixed in Kubernetes upstream. CAI, can you try to reproduce it again and write down a regression test for it? I am setting needinfo for the QA contact to answer the question in the comment #10. Can not reproduce on OCP v3.5.0.52. The pod was removed within grace period when its endpoint was removed. Lowering severity/priority. The test verified the issue is gone with OCP v3.5.0.52, the bug can be closed with currentrelease. |