Bug 2014083
| Summary: | When a pod is force deleted the volume may not be unmounted from the node, that causes mkfs to fail but the PV is set to available | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mario Vázquez <mavazque> |
| Component: | Node | Assignee: | Peter Hunt <pehunt> |
| Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | high | CC: | alosadag, aos-bugs, augol, bzhai, cback, colleen.o.malley, ealcaniz, ehashman, harpatil, igreen, jbrassow, jfindysz, jsafrane, keyoung, kir, kkarampo, krizza, mcornea, nagrawal, obulatov, openshift-bugs-escalate, peasters, pehunt, rphillips, rsandu, vlaad |
| Version: | 4.8 | Flags: | ehashman:
needinfo-
|
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-01-17 17:59:43 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Mario Vázquez
2021-10-14 12:57:33 UTC
@kir any news on this? if the fix was included in 4.6 should not be already included in 4.8? It looks like the runc fix mentioned earlier is unrelated. From my perspective, what happens here is there are some processes still left in the cgroup, this is the reason why it can't be removed. Volumes are out of runc scope, but the volume might be left mounted for the same reason why cgroup can't be removed -- there are some processes using it. I'd use lsof or similar tools to investigate further. If a file system cannot be unmounted due to a process using it, there isn't much that can be done by the file system or below - it would have to be resolved by the process releasing the FS first, then unmounting, etc. It is possible that it is storage related if: 1) the storage is blocking/hung and the process that is preventing the unmount cannot proceed. This could happen, for example, if a device-mapper device was suspended (I don't immediately see an indication of this). 2) A FS or storage tool /is/ the process preventing unmount. It could be hung due to locking or other software bug. I find this very unlikely, as most often these tools do not utilize the storage they are administering. Comment 9 seems like the best suggestion - identify the process(es) that are running which are preventing the unmount. There are nastier hacks which could be performed (like remapping the PV to 'error'), but that would be worst case, I think. I now suspect this is a dup of https://bugzilla.redhat.com/show_bug.cgi?id=2003206. according to https://bugzilla.redhat.com/show_bug.cgi?id=2014083#c18 we're hitting a deadlock with stop. I could verify if we could gather the goroutine stacks of a running instance: https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md#printing-go-routines setting need info for information requested in https://bugzilla.redhat.com/show_bug.cgi?id=2014083#c41 a clarification comment: if we get into this situation where a pod can't be cleaned up, I'll need someone to ssh to the node and run commands described in https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md#printing-go-routines and post the file written to /tmp. That will help me determine whether the issue we're seeing is the same as the suspected duplicate. (In reply to Peter Hunt from comment #44) > a clarification comment: if we get into this situation where a pod can't be > cleaned up, I'll need someone to ssh to the node and run commands described > in > https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md#printing-go- > routines and post the file written to /tmp. That will help me determine > whether the issue we're seeing is the same as the suspected duplicate. Hi Peter, I will contact customer and get you that stack trace asap. The case linked to this bz was put on hold due to the possible duplicate with namespace termination mentioned. br, Chris The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |