Bug 1609703
Summary: | APP pod unable to start after target port failure in cases where single paths are mounted on APP pods(BZ#1599742) | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Neha Berry <nberry> | |
Component: | Storage | Assignee: | Jan Safranek <jsafrane> | |
Status: | CLOSED ERRATA | QA Contact: | Liang Xia <lxia> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 3.10.0 | CC: | aos-bugs, aos-storage-staff, apanagio, bchilds, hchiramm, jhou, jsafrane, lxia, madam, nberry, vlaad | |
Target Milestone: | --- | Flags: | lxia:
needinfo-
|
|
Target Release: | 3.10.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1609788 1637413 1637422 (view as bug list) | Environment: | ||
Last Closed: | 2018-11-11 16:39:10 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1596021, 1598740, 1609788, 1637413, 1637422 |
Description
Neha Berry
2018-07-30 08:17:52 UTC
> Please let me know if I can re-use my setup or it is needed by you for further analysis.
I downloaded the logs (thanks!), you can destroy the machines.
3.10.z PR: https://github.com/openshift/ose/pull/1431 > b) Mount existed for block6 on both the nodes for more than 2H. Then finally we saw the iqn was logged out from dhcp47-196.
in final_journalctl, I can see that pod cirrosblock6-1-qwkp9_glusterfs was deleted, but something prevented its docker container from exiting, the log is full of these messages:
cirrosblock6-1-qwkp9_glusterfs(184cbc6b-c754-11e8-8b48-005056a50953)" has been removed from pod manager. However, it still has one or more containers in the non-exited state. Therefore, it will not be removed from volume manager.
Therefore kubelet did not unmount its volumes. I'd suggest to create a separate issue and consult pod team.
In 3.10 and 3.11 we included these upstream PRs: * https://github.com/kubernetes/kubernetes/pull/63176 - deletes LUNs that are not needed - avoids excessive scanning for LUNs that kubelet does not need - prerequisite of the other PRs * https://github.com/kubernetes/kubernetes/pull/67140 - wait up to 10 seconds for multipath devices to appear - this prevents single path mounted instead of multipath * https://github.com/kubernetes/kubernetes/pull/69140 - fixes regression introduced by the above PRs * https://github.com/kubernetes/kubernetes/pull/68141 - re-tries attaching missing paths, until there are at least two of them available (or the plugin tried 5 times and gave up) -> at least two paths (= multipath) should be always used unless there is something really bad with the target. Summed together, you should always get some multipath, unless one target is in really bad shape. Verified on v3.10.57 - A multipath device is created instead of a single path. - Login sessions are cleaned up when Pod is deleted. - Having multiple portals, if one path is unavailable, it retries 5 times. ``` iscsi: failed to sendtargets to portal 10.10.10.10:3260 output: iscsiadm: connect to 10.10.10.10 timed out iscsiadm: connect to 10.10.10.10 timed out iscsiadm: connect to 10.10.10.10 timed out iscsiadm: connect to 10.10.10.10 timed out iscsiadm: connect to 10.10.10.10 timed out iscsiadm: connect to 10.10.10.10 timed out iscsiadm: connection login retries (reopen_max) 5 exceeded ``` - Having multiple portals, if only one portal is available, then the Pod will be stuck at 'ContainerCreating' waiting for multipath to setup. - If only one portal is specified, then a single path is setup. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2709 |